Abstract

Mendelian randomization (MR) is a widely used method to estimate the causal effect of an exposure on an outcome by using genetic variants as instrumental variables. MR analyses that use variants from only a single genetic region (cis-MR) encoding the protein target of a drug are able to provide supporting evidence for drug target validation. This paper proposes methods for cis-MR inference that use many correlated variants to make robust inferences even in situations, where those variants have only weak effects on the exposure. In particular, we exploit the highly structured nature of genetic correlations in single gene regions to reduce the dimension of genetic variants using factor analysis. These genetic factors are then used as instrumental variables to construct tests for the causal effect of interest. Since these factors may often be weakly associated with the exposure, size distortions of standard t-tests can be severe. Therefore, we consider two approaches based on conditional testing. First, we extend results of commonly-used identification-robust tests for the setting where estimated factors are used as instruments. Second, we propose a test which appropriately adjusts for first-stage screening of genetic factors based on their relevance. Our empirical results provide genetic evidence to validate cholesterol-lowering drug targets aimed at preventing coronary heart disease.

1. Introduction

Mendelian randomization (MR) is a widely-used method to estimate the causal effect of an exposure on an outcome by using genetic variants as instrumental variables. An emerging area of clinical research concerns MR studies that use genetic variants from only single genetic regions of pharmacological interest (cis-MR). Compared with polygenic MR where variants may be chosen from multiple gene regions, cis-MR is often used when the exposure of interest is regulated by a specific gene, such as the coding gene for an exposure that is a protein. For such analyses, the cis-MR design has been suggested to be less prone to confounding from pleiotropy, where variants may be associated with the outcome through their effects on traits other than the exposure (Swerdlow et al., 2016).

Moreover, the potential effect of a drug can be investigated by an MR analysis of a genomic locus encoding protein targets of medicines (Walker et al., 2017). As a result, the cis-MR approach is being increasingly used to provide valuable evidence which can inform designs of expensive clinical trials (Gill et al., 2021). Several examples of cis-MR analyses that have provided insight on the potential efficacy of drug interventions are highlighted in Burgess et al. (2023).

A starting point for any MR analysis is to choose appropriate instruments. Since genome-wide association studies (GWASs) have been able to identify many strong genetic signals for a wide range of traits, the typical practice in polygenic MR is to select uncorrelated variants with strong measured associations with the exposure.

In contrast, the potential pool of instruments that we can consider for cis-MR is more limited in two aspects. First, the instruments are typically in highly structured correlation, owing to how genetic variants in the same region tend to be inherited together. Second, when the exposure of interest is a gene product, genetic associations are typically measured from much smaller sample sizes than usual GWASs, which would leave cis-MR analyses more vulnerable to problems of weak instrument bias (Andrews et al., 2019). Therefore, for our cis-MR focus, we will need to make use of many weak and correlated instruments.

One intuitive option is to filter out variants such that only a smaller set of uncorrelated or weakly correlated instruments remain. However, for cis-MR analyses that involve only weak genetic signals, it would seem important to represent the evidence suggested by many variants in the gene region. Another option is to select only those variants with a strong measured association with the exposure. While this might avoid problems relating to weak instruments, estimation could be more vulnerable to a winner's curse bias (Goering et al., 2002), resulting in poor inferences if the additional uncertainty from instrument selection is not accounted for (Mounier & Kutalik, 2023).

In this paper, we do not propose selecting specific genetic variants as instruments, but rather genetic factors. We consider a two-stage approach. In the first stage, we exploit the highly structured nature of genetic correlations in single gene regions to reduce the dimension of genetic variants. Following Bai & Ng (2002)'s approximate factor model, the variation in a large number of genetic variants is assumed to be explained by a finite number of latent factors. The estimated genetic factors are particular linear combinations of genetic variants which aim to capture all systematic variation in the gene region of interest. In the second stage, these estimated genetic factors are used as instrumental variables.

We focus on the problem of providing robust summary-data inference when the genetic factors are weak instruments. This is a concern not only due to the potentially smaller sample sizes involved in cis-MR analyses, but because the first-stage dimension reduction of genetic variants is based on their mutual correlation, and not on the strength of their association with the exposure. Thus, there is no guarantee that the estimated genetic factors would be strong instruments.

To provide valid inferences when estimated genetic factors are weak instruments, we consider two different approaches based on conditional testing. The first approach generalizes popular identification-robust tests (Moreira, 2003; Wang & Kang, 2022) for our setting with estimated genetic factors as instruments. Similar to Bai and Ng (2010)'s analysis under strong instruments, the asymptotic null distributions of the identification-robust test statistics can be established even when the true genetic factors are not identified.

One drawback with the identification-robust approaches is that they are unable to provide point estimates of the causal effect. In situations where a few instruments are considerably stronger than others, it is natural to question whether it might be better to discard those instruments which are almost irrelevant. In our case, if some of the estimated genetic factors appear to have very weak associations with the exposure, then we may consider dropping them, and then proceed with usual point estimation strategies. Therefore, in the second approach, we propose a test which appropriately adjusts for first-stage screening of genetic factors based on their relevance. The test controls the selective type I error: the error rate of a test of the causal effect given the selection of genetic factors as instruments (Fithian et al., 2017; Bi et al., 2020).

Our empirical motivation concerns a potential drug target of growing interest. Cholesteryl ester transfer protein (CETP) inhibitors are a class of drug that increase high-density lipoprotein cholesterol and decrease low-density lipoprotein cholesterol (LDL-C) concentrations. A recent cis-MR analysis by Schmidt et al. (2021) suggests that CETP inhibition may be an effective drug target for reducing coronary heart disease (CHD) risk.

A simulation study based on real CETP gene summary data illustrates how both factor-based conditional testing approaches offer reliable inferences under realistic problems faced in practice: weak instruments, invalid instruments, and mismeasured instrument correlations. Our application complements Schmidt et al. (2021)'s findings by providing robust evidence that the genetically-predicted LDL-C lowering effect of CETP inhibitors is associated with a lower risk of CHD.

We use the following notation and abbreviations: formula ‘converges in probability to’; formula ‘converges in distribution to’; formula ‘is asymptotically distributed as’. For any sequences formula and formula, if formula, then there exists a positive constant C and a positive integer N such that for all formula, formula and formula. If formula, then formula as formula. Also, if formula, then there exist positive constants C1 and C2, formula, and a positive integer N such that formula for all formula. Let formula denote the jth element of any vector A, and formula denote the formulath element of any matrix B. For any constant a, let formula denote its population variance. For any vector A, let formula denote its population variance–covariance matrix. Let formula denote the Euclidean norm of a vector. For any positive integer A, formula. The proofs of the theoretical results are given in Web Appendix C.

2. Approximate Factor Model and Summary Data

Let X denote the exposure, Y the outcome, and formula a mean-centered vector of p genetic variants which we assume are valid instrumental variables. For each variant formula, let formula denote the population covariance of variant j with the exposure, and formula denote the population covariance of variant j with the outcome. We are interested in estimating the causal effect of the exposure X on an outcome Y, which is denoted θ0 and is described by the linear model

1

where formula and formula. Although this specification does not explicitly allow for variants to have direct effects on the outcome that are not mediated by the exposure, we will later discuss how the approaches we consider are quite robust to this assumption.

2.1 Approximate Factor Model

Our asymptotic framework considers the setting where formula, since we aim to incorporate information from many genetic variants. For our cis-MR focus, we would expect a large number of genetic variants to have a block-like correlation structure (see, e.g., the variant correlations from the CETP gene in Figure 3 of Section 5). Therefore, we assume genetic variants in the region of interest follow an approximate factor model structure (Bai & Ng, 2002),

2

where formula is an unobserved formula matrix of factor loadings, f is an r-vector of unobserved factors, and formula is a p-vector of idiosyncratic errors. For each variant j, the component formula describes its systematic variation. Although p is large, r is considered to be finite; the systematic variation of p variants can be explained by a much smaller set of r latent factors. Thus, instead of using p genetic variants as instruments, we will aim to use the information of these r latent factors to construct instruments. We note that these latent factors are estimated for a specific gene region, and not for the whole genome; the latter are often used to adjust for the population structure.

Assumption 1. approximate factor model

(i) The unobserved factors and idiosyncratic errors are identically and independently distributed across individuals; (ii) the factor loadings, factors, and idiosyncratic errors are three mutually independent groups, and the idiosyncratic errors may have some dependence across variants; (iii) the idiosyncratic errors may have limited correlation with the sampling errors of genetic association estimates; (iv) the factors satisfy formula and formula is an formula positive-definite matrix; (v) for all variants j, formula for some constant formula, and formula, where formula is a positive-definite, non-random matrix, as formula.

Power results when testing the null hypothesis  for a 5% level test. The first row of panels correspond to Model 1 in Section 4.1. The second and third rows of panels correspond to Models 2 and 3 in Section 4.2. The fourth and last rows of panels correspond to Models 4 and 5 in Section 4.3.
Figure 1

Power results when testing the null hypothesis formula for a 5% level test. The first row of panels correspond to Model 1 in Section 4.1. The second and third rows of panels correspond to Models 2 and 3 in Section 4.2. The fourth and last rows of panels correspond to Models 4 and 5 in Section 4.3.

Root-mean squared error (RMSE) results under locally invalid instruments (Model 3). The histograms correspond to the standardized F-LIML estimates, and the solid black line is the N(0, 1) density curve.
Figure 2

Root-mean squared error (RMSE) results under locally invalid instruments (Model 3). The histograms correspond to the standardized F-LIML estimates, and the solid black line is the N(0, 1) density curve.

Genetic variant correlations (left), 368 genetic associations with LDL-C and CHD (center), and 14 estimated factor associations with LDL-C and CHD (right), in the CETP gene region. CHD, coronary heart disease; LDL-C, low-density lipoprotein cholesterol.
Figure 3

Genetic variant correlations (left), 368 genetic associations with LDL-C and CHD (center), and 14 estimated factor associations with LDL-C and CHD (right), in the CETP gene region. CHD, coronary heart disease; LDL-C, low-density lipoprotein cholesterol.

Assumption 1 implies Assumptions A–F from Bai (2003, pp. 141–144); the assumptions imply that r strong factors exist, and that, as formula, there is a significant difference between the rth and formulath eigenvalues of the variance–covariance matrix of Z, formula. Throughout our analysis, r is considered to be fixed and known. In Section 4.4, we discuss the potential issues with misspecifying the true number of factors.

Compared with classical factor models, the assumptions maintained in an approximate factor model are weak enough to prevent separate identification of factors and factor loadings, however both can be estimated up to an formula rotation matrix. This should involve no loss of information since in terms of retaining the same explanatory power, we require only that the estimated factors span the same space as the true factors (Bai & Ng, 2002). For details on the dependence permitted between idiosyncratic errors across variants, see Bai and Ng (2010, Assumption A(c), p. 1581). We take the assumptions of the factor model at face value, but in practice we recommend verifying that there is significant structure in the variance–covariance matrix of genetic variants, and that the estimated factors explain a large proportion of genetic variation. Further discussion of Assumption 1(iii) is provided in Web Appendix B.2.

2.2 Two-sample Summary Data

We work within the popular two-sample summary data design, where estimated variant–exposure associations formula are obtained from a non-overlapping, but representative, sample from estimated variant–outcome associations formula. Let formula denote the size of the sample used to compute variant–exposure associations, and formula denote the sample size used to compute variant–outcome associations. We assume that formula for some positive constant formula. This ensures that for our two-sample setting, the sampling uncertainty from one association study is not negligible to the other.

For any two variants j and k, we also assume knowledge of the population genetic correlation formula. Genetic correlation estimates can be obtained from popular MR software packages (see, e.g., Hemani et al., 2018). Using a similar approach to Wang & Kang (2022, Theorem 2), we can combine these estimates with GWAS summary data which allow us to construct an estimate of the variance–covariance matrix of genetic variants, formula.

Assumption 2. summary data on genetic associations

The estimated genetic associations satisfy

where formula, formula, and formula.

Assumption 2 states that the estimated genetic variant–trait covariances are normally distributed around their population counterparts, and that their standard errors are decreasing at the usual parametric rate. This is usually justified by large random sampling in GWASs (Zhao et al., 2020, Assumption 1). Assumption 2 also describes a setting of conditional homoscedastic errors in a linear model, where conditional trait variances given genetic factors are constant. In general, a similar assumption is often maintained for conducting joint analyses from marginal summary genetic data (Yang et al., 2012).

2.3 Weak Genetic Associations

Our asymptotic analysis relies on the assumption that many genetic variants have weak associations with the exposure; see, for example, Zhao et al. (2020).

Assumption 3. many weak genetic associations

formula as formula, where formula.

Assumption 3 implies that the average explanatory power of any individual variant is decreasing with the total number of variants. Since we do not directly use individual variants as instruments, this does not necessarily imply we face a weak instruments problem. We will take r linear combinations of all variants to use as instruments; these linear combinations will correspond to taking the space spanned by the factor loadings formula, which we can consistently estimate under the rate restriction formula, as formula. As discussed by Bai and Ng (2010), it is possible for these linear combinations to be strong instruments even if the explanatory power of individual variants is limited.

However, for our focus, we deem this to be quite unrealistic: our dimension reduction of genetic variants will be based on their covariance structure, not their association with the exposure. Hence, we could end up using instruments that are able to summarize nearly all genetic variation in a gene region, but they are still weakly associated with the exposure. For this reason, we should focus on inferential methods that are robust to weak instruments.

3. Conditional Inference in cis-MR with Genetic Factors

3.1 Estimating the Factor Loadings

We start by estimating the formula matrix of factor loadings formula using the estimated variance–covariance matrix of Z, formula. For a given number of factors r, let formula denote a formula matrix with its columns given by the eigenvectors corresponding to the largest r eigenvalues of formula multiplied by formula. Then, the estimated re-scaled factor loadings are given by formula, so that formula.

For our analysis, the number of factors r is assumed to be known. In practice, we may decide on the number of factors by inspecting the scree plot of formula, or by using data-driven methods as in Onatski (2010).

3.2 Point Estimation under Strong Factor Associations

Under the linear model (1), we can use our variant–exposure and variant–outcome associations to construct a vector of estimating equations, formula. Here, there are p estimating equations for 1 unknown θ. Given our estimated factor loadings, we can effectively reduce the degree of over-identification.

Let formula, so that formula provides r estimating equations for θ0. In other words, formula are the estimating equations implied by using the linear combination of variants formula as instruments. For brevity, we will refer to formula as the estimated factors.

First, we construct consistent estimators formula for formula, and formula for formula, where formula. Then, we can construct a consistent estimator formula for formula. A limited information maximum likelihood (LIML; Anderson and Rubin, 1949) estimator is given by

We call formula the F-LIML estimator; the LIML estimator which uses the entire vector of estimated factors as instruments.

Theorem 1. Under Assumptions 1–3 and (Equations 1) and (2), if formula, then formula as formula, where formula, formula, and formula.

The condition formula means that the genetic factors are collectively strong instruments, and it requires that many variants have weak effects on the exposure. Theorem 1 can be used directly to construct asymptotic confidence intervals and tests for the causal effect θ0.

3.3 Identification-robust Tests under Weak Factor Associations

Standard t-tests based on Theorem 1 will not be valid when the estimated factors are weak instruments. This is because under weak instrument asymptotics the distribution of t-tests will depend on a measure of instrument strength (Stock et al., 2002). Instead, identification-robust tests offer a way to make valid inferences in this setting. The basic idea behind this approach is to construct pivotal test statistics conditional on a sufficient statistic for instrument strength. Then, since the conditional distributions of these test statistics do not depend on instrument strength under the null hypothesis, the size of the tests can be controlled under weak instruments.

We can follow previous works by constructing these test statistics as a function of two asymptotically mutually independent statistics formula and formula, where formula carries the information of the estimated factors being valid instruments, and where formula incorporates information on the strength of these instruments. Specifically, under the null hypothesis formula, let formula, and formula, where formula, formula, and where formula, formula, and formula are defined in Section 3.2.

Using formula and formula, we can construct three commonly-used identification-robust test statistics which will be asymptotically pivotal conditional on formula, where formula, formula, formula, formula, and H is a rotation matrix. Let formula, formula, and formula. Then, the Anderson and Rubin (1949) statistic with estimated factors, denoted F-AR, is given by formula, Kleibergen (2005)'s Lagrange multiplier statistic with estimated factors, denoted F-LM, is given by formula, and Moreira (2003)'s conditional likelihood ratio statistic with estimated factors, denoted F-CLR, is given by formula.

Theorem 2. Suppose that formula and formula, formula, for some formula. Under Assumptions 1–3, (Equations 1) and (2), and formula, conditional on formula, (i) formula; (ii) formula; and (iii) formula as formula, where formula and formula denote independent chi-square random variables.

The rate restriction formula describes the setting where all genetic factors are weak instruments. The condition allows for all variant associations with the exposure to be collectively weak formula as long as they are not too uneven. Since the F-AR statistic is not a function of formula, it does not incorporate the identifying power of instruments. As a result, when the model is over-identified (formula), the F-AR test may have relatively poor power properties compared with the F-LM and F-CLR tests (Andrews et al., 2019). Of the three methods, CLR-based tests are widely regarded as the most powerful, due to simulation evidence and favorable theoretical properties (Andrews et al., 2006). To implement the CLR test, we use the algorithm for computing its asymptotic conditional p-value derived by Andrews et al. (2007).

3.4 Conditional Tests That Adjust for Factor Selection

While identification-robust approaches are designed to control type I error rates for any level of instrument strength, they do not provide point estimates. In a sparse effects setting where a few estimated factors are strong instruments and most other estimated factors are very weak instruments, it would be tempting to proceed with F-LIML point estimation after removing very weak instruments. The approaches described in Sections 3.2 and 3.3 use the entire r-vector of estimated factors as instruments. In contrast, here we wish to filter out certain elements if they are demonstrably weak instruments.

To this end, we construct pre-tests to identify a subset of estimated factors that pass a threshold of relevance; only this subset is then used as instruments. By Bai (2003), formula estimates formula where H is a rotation matrix. Hence, the estimated factor associations formula actually estimate formula, and not formula as we may intuitively expect. Therefore, for each formula, we will test the null hypothesis formula against the alternative formula, where formula.

Simple t-tests are used to screen for relevant estimated factors, using the asymptotic approximation formula, formula. In particular, to conduct a two-sided asymptotic υ-level test for the significance of each estimated factor formula, we compare the test statistic formula, where formula, against the critical value formula, where Φ(.) is the standard normal cumulative distribution function. For each formula, if formula, then we have evidence to reject formula, and thus we include the estimated factor j as an instrument. Any estimated factors such that formula are deemed to be weak instruments, and are thus discarded.

Let formula denote the selection event according to these pre-tests. For example, if formula and only the first and third estimated factors pass the pre-test of relevance, then formula. Only using the subset of estimated factors that have passed the pre-test of relevance, we will test the null hypothesis formula against the general alternative formula. To appropriately account for pre-testing of relevant estimated factors, we seek to construct a conditional test which controls the selective type I error (Fithian et al., 2017). That is, for an α-level test, formula under formula; that is, we control the error rate of the test at αgiven the selection event formula.

Suppose formula is the number of selected estimated factors, and let formula denote the indices of the selected set of estimated factors, so that the selection event is formula. We also let formula denote an formula selection matrix which is constructed such that formula is the formula matrix, where its formula columns are the columns of formula that correspond to the selected factors formula. We call the resulting LIML estimator which uses only the selected factors ‘Selected LIML’ (S-LIML), which is denoted as formula.

We are often interested in testing the null hypothesis formula for non-zero values formula; for example, this is useful for obtaining confidence intervals by test inversion. In this case, we would need to consider how the distribution of formula is impacted by the uncertainty of the pre-test results formula (see Web Appendix D.1 for further discussion). Let formula, formula, and formula. We base our inferences on the following joint normality approximation which partially describes the dependency of formula on the vector of pre-test statistics formula,

3

where formula, formula, and D is the diagonal matrix with its formulath element given by formula. We take this approximation as given, but it is not exact because we do not account for the estimation error term formula, for formula, which may not be negligible under the setting where the estimated factors are strong instruments. However, in our simulation experiments, this approximation performs reasonably well under weak and strong instrument settings. (Equation 3) suggests that the conditional distribution of formula given formula depends on an r-dimensional nuisance parameter formula. Thus, along with the selection event formula, we will condition on a sufficient statistic for the nuisance parameter, which will cause it to drop from the conditional distribution of formula (see, e.g., Sampson & Sill, 2005). According to (3), a sufficient statistic for formula is given by formula.

Theorem 3. Under (Equation 3) and formula, the conditional distribution of formula given formula and formula is approximately

4

where formula, and formula.

Intuitively, this conditional distribution of formula reveals what the likely values of formulashould be under formula, given the results of the pre-tests formula and observed value formula. If the S-LIML estimate formula does not lie in a suitable likely region, then we interpret this as evidence against formula.

We can construct estimators formula of CG, formula of formula, where formula, formula, and formula is the diagonal matrix with its formulath element given by formula. By conditioning on formula, we can conduct an approximate α-level test for formula by using the sample analog of the right-hand side of (Equation 4), taking repeated draws of formula, and computing formula and formula-level quantiles of the approximated formula distribution under formula. If the S-LIML estimate formula does not lie within those quantiles, then we reject the null hypothesis formula.

4. Simulation Study

This section presents the performance of the identification-robust and conditional test statistics in a simulation study based on real genetic data. The simulation design aims to explore the robustness of our empirical results in Section 5, where we investigate CETP inhibitors as a potential drug target for CHD. CETP are a class of drug which increase high-density lipoprotein cholesterol and decrease LDL-C. Our exposure X is LDL-C, our outcome Y is CHD, and to construct instruments we used formula genetic variants from a neighborhood of the CETP gene which are associated with LDL-C at p-value less than formula.

The true factor loadings were set as the factor loadings corresponding to the measured variance–covariance matrix of genetic variants in the CETP region, and the true variant–trait associations formula and formula were taken as the measured variant associations with LDL-C and CHD. We generated formula and formula according to Assumption 2 and (Equation 1). The unconditional trait variances were set equal to 1, and the sample sizes for both association studies were set equal to n, where n was chosen to vary the instrument strength of genetic factors according to a particular value of the F-statistic. Unless otherwise stated, the F-statistic was set equal to 20, and the true number of factors was set to formula.

The variant correlation matrix formula was held fixed and equal to the measured correlation matrix from the CETP gene. For our proposed methods, we generated sampling errors in estimation of the variance–covariance matrix formula. In particular, for each variant j, we only observed formula which was generated as a truncated normal with mean formula, variance formula, bounded by the minimum and maximum measured formula over all variants formula.

Our proposed methods are based on dimension reduction of all variants rather than selecting specific variants. Instead of using estimated factors as instruments, we might wonder if it is better to omit highly correlated variants and use existing approaches which account for a smaller number of moderately or nearly uncorrelated correlated variants as instruments. Thus, for comparison we note the results of Wang & Kang (2022)'s CLR test, where variants are filtered out if they are correlated with an already included variant at some pre-specified threshold. In MR terminology, this is called pruning. Another way to incorporate moderately correlated variants is to de-correlate summary data associations using knowledge of the variant–correlation matrix ρ, and then apply simpler MR estimation strategies such as inverse variance weighted (IVW; Burgess et al., 2013). DecorrIVW notes the results of a de-correlated IVW method using a pruning threshold of formula.

Finally, we also consider approaches which assume uncorrelated variants, but are robust to many weak instruments and outlier effects, such as robust adjusted profile Score (RAPS; Zhao et al., 2020), De-biased IVW (DIVW; Ye et al., 2021), and a radial IVW regression approach with second-order weights (RAD, Bowden et al., 2019). For indicating the level of pruning used, the tests CLR-0, CLR-1, CLR-2, and CLR-4 will denote the CLR test pruned to near-independence after selecting the strongest associated variant with LDL-C (formula), formula, formula, and formula, respectively. Likewise, RAPS-0, RAPS-1, and other tests are defined analogously.

Our simulation study focuses on studying the performance of tests under three practical problems of interest in cis-MR analyses: weak instruments, invalid instruments, and mismeasured instrument correlations. The full results from all methods are given in Web Appendix D.

4.1 Weak Instruments

Since proteins are the drug target of most medicines, cis-MR analyses often use proteins as the exposure of interest. Genetic associations with protein or gene expression are typically measured with smaller sample sizes than usual GWASs. In practice, this can result in a weak instruments problem. In Model 1, we explore the performance of tests under correct model specification and varying instrument strength, as measured by the F-statistic of the association between the true genetic factors formula and X.

The first row in Figure 1 shows that when all variants are valid instruments, using moderately correlated individual variants as instruments performs well; the CLR-1 and DecorrIVW tests are considerably more powerful than the factor-based tests even under a pruning threshold of formula. Using estimated factors as instruments provide reliable inference for both F-CLR and S-LIML approaches, with both tests offering slightly more power than CLR-0 under strong instruments. The S-LIML test screened the estimated factors for relevance using formula level pre-tests for formula, formula for formula, and formula for formula. With these thresholds, the median number of factors retained were formula under formula, and formula under formula. The first row in Figure 1 further suggests that under weak factor instruments formula, F-LIML which uses all 11 estimated factors does not control type I error rates. Although LIML estimators may provide consistent estimation under many weak instruments (Chao & Swanson, 2005), for inference, the conventional standard errors are too small (Newey & Windmeijer, 2009).

4.2 Invalid Instruments

Although the linear model (1) is commonly used in practice, we may be concerned that proportionality of genetic associations formula may not hold exactly over all variants formula. Fortunately, the factor model approach may provide some robustness to the inclusion of invalid instruments. For example, under Bai and Ng (2010)'s analysis, a finite number of variants would be permitted to have direct effects on the outcome as long as the total number of variants p grows sufficiently faster than n.

Here, we study finite-sample behavior under local model misspecification formula, where the direct effects formula are fixed. Under this setting, the squared bias and variance terms of the F-LIML estimator that comprise its mean squared error are of the same order of magnitude asymptotically. In Model 2, these direct effects are equal for all variants, and the results are displayed in the second row of Figure 1. For formula, the CLR-1 and DecorrIVW tests which use moderately correlated individual variants as instruments offer competitive size properties along with the factor-based approaches. However, for more severe problems of directional pleiotropy (formula), F-CLR and S-LIML are better able to control type I error rates.

In Model 3, the direct effects are generated as formula, where each element of formula is drawn from the uniform distribution formula, and we vary instrument strength with the F-statistic. This is similar to an assumption of balanced pleiotropy often maintained in polygenic MR (Hemani et al., 2018). When the direct effects formula are random around zero, their impact is to inflate the variance of the resulting estimate. In contrast, here we set the direct effects formula to be fixed, so that they directly impact the bias of the resulting estimates, with no de-biasing adjustment possible without imposing further restrictions. The results from Model 3 are displayed in the third row of Figure 1, and they further highlight the robust performance of F-CLR and S-LIML under invalid instruments, with relatively small-size distortions regardless of instrument strength.

4.3 Mismeasured Variant Correlations

In Models 4 and 5, our simulation designs investigate robustness to a very common problem in cis-MR analyses. It is often the case that the variant correlation matrix ρ is not provided alongside summary genetic association data. In such situations, if researchers want to make use of correlated variants, they would need to obtain estimates of the variant correlation matrix from a reference sample containing a different set of subjects.

Discrepancies between the variant correlation matrix from the reference sample formula, and the true variant correlation matrix for the two-sample summary data formula may arise due to at least two reasons. First, the size of the reference sample may be significantly lower than the sample size of GWASs, thus allowing more room for sampling errors. Second, while all samples are assumed to be drawn from the same population, in practice the two correlation estimates may be based on heterogeneous samples.

To study the problem of mismeasured variant correlations, in Model 4, for any two variants formula, we assume the variant correlation estimate available to the researcher satisfies formula, where formula are the true variant correlations used to construct the two-sample summary associations, and κ0 is a fixed constant. Here, we set where formula since the minimum observed value in the variant correlation matrix was −0.839, so that formula still returns feasible correlation values. The fourth row of Figure 1 displays the results for Model 4, where we note that CLR-0 which erroneously assumes variants are nearly uncorrelated has inflated type I error rates. It was not possible to note the performance of DecorrIVW due to numerical instabilities, which suggests that an approach based on de-correlating genetic associations may be more sensitive to mismeasured variant correlations compared with methods that are designed to account for the variant correlation structure.

The design in Model 5 aims to shed light on the potential problems with misspecifying the correlation structure of formula, which is of practical concern if we suspect that genetic correlations are measured from a population which is not representative of participants from which genetic–trait associations are obtained. We assume that the measured variant correlation matrix available to the researcher is formula, for all formula, and formula. If formula, moderate variant correlations are pushed toward zero, which may result in a fewer number of factors explaining a large proportion of genetic variation. Under Model 5, we selected the number of factors based on identifying an eigenvalue gap in the scree plot. Instead of formula, we used five estimated factors under formula, and four estimated factors under for formula. The results are displayed in final row of Figure 1. Under formula, F-CLR, S-LIML, and F-LIML have very poor power properties, partially owing to an under-selection of estimated factors. For Model 5, CLR-0 appears to achieve a good balance between type I error control and power. Intuitively, we would prefer filtering to near-independence rather than use correlated variants based on a largely misspecified correlation structure.

4.4 Misspecifying the Number of Factors

In this section, we discuss the potential problems with misspecifying the true number of factors r, which are assumed to be known throughout our analysis. Here, we study the same local misspecification design as Model 3 discussed in Section 4.2, and vary the assumed number of factors formula away from formula. The histograms from Figure 2 show the standardized F-LIML estimates formula which should be asymptotically distributed as N(0, 1) according to Theorem 1. The results show that under-selecting the number of factors (formula) still returns median-unbiased estimates, whereas using a higher number of factors appears to introduce some bias. For higher values of formula, we run the risk of additional factors prioritizing the information of variants which are in less structured correlation. Under Model 3, many such variants could have strong direct effects on the outcome which may cause estimates to become biased.

Interestingly, over-selecting the number of factors still results in improved estimation in terms of root-mean squared error (RMSE); allowing a small amount of bias is a price worth paying for significant gains in variance reduction. Compared with formula, the RMSE of F-LIML is halved when selecting formula factors under very weak instruments (formula). Especially under weak instrument settings formula, S-LIML outperforms F-LIML in terms of RMSE, which further underscores the importance of filtering out weak estimated factors. When formula, S-LIML appears to be the best choice for point estimation compared to alternative robust methods. More generally, the DecorrIVW and DIVW methods under strong pruning formula also provide precise estimates. We also find that over-estimating the true number of factors may lead to only modest increases in type I error rates for the F-CLR and S-LIML methods under Model 5 (see Figure S.9 in Web Appendix D). In practice, we suggest that it may be preferable to lean toward over-selecting the number of factors rather than under-selecting based on MSE considerations.

5. Empirical Application: Cholesteryl Ester Transfer Protein Inhibition and Coronary Heart Disease

CETP inhibitors are a class of drug which increase high-density lipoprotein cholesterol and decrease LDL-C concentrations. At least three CETP inhibitors have failed to provide sufficient evidence of a protective effect on CHD in clinical trials, before the successful trial of Anacetrapib showed marginal benefits alongside statin therapy (Bowman et al., 2017). However, with further trials currently ongoing, cis-MR analyses can offer important supporting evidence to complement experimental results. For example, in recent work, Schmidt et al. (2021)'s cis-MR analysis suggests that CETP inhibition may be an effective drug target for CHD prevention.

From a statistical perspective, we may have a few concerns regarding the criteria used by Schmidt et al. (2021) to select instruments. First, to guard against weak instrument bias, they select variants based on an in-sample measure of instrument strength (F-statistic > 15), which could potentially leave the analysis vulnerable to a winner's curse bias (Mounier & Kutalik, 2023). Second, to guard against heterogeneity of genetic associations, they use a measure of instrument validity to remove outliers (Cochran's Q statistic; see, e.g., Bowden et al., 2019), which can result in size-distorted tests (Guggenberger & Kumar, 2012). Finally, for those correlated variants with strong measured associations with the outcome, they allow variants up to a pruning threshold of formula; our simulation results in Web Appendix D show that inference can be sensitive to the choice of a pruning threshold.

Here, we apply conditional inference techniques to investigate the genetically-predicted LDL-C lowering effect of CETP inhibition on the risk of CHD. Genetic associations with LDL-C were taken from a GWAS of 361,194 individuals of white-British genetic ancestry in the UK Biobank and were in standard deviation units (Sudlow et al., 2015). Genetic associations with CHD, measured in log odds ratio units, were taken from a meta-GWAS of 48 studies with a total of 60,801 cases and 123,504 controls from a majority European population, conducted by the CARDIoGRAMplusC4D consortium (Nikpay et al., 2015). Genetic variant correlations were obtained from a reference panel of European individuals (1000 Genomes Project, Auton et al., 2015) using the twosampleMR R package (Hemani et al., 2018).

Since our method assumes a linear model, we transformed the measured variant–CHD associations that were obtained under a univariable logit regression to approximate estimated coefficients from a univariable linear regression. Further details on this transformation are provided in Web Appendix A.

A total of 368 genetic variants were drawn from the CETP region, with variant positions within ±100 kb from the CETP gene position indicated on GeneCards (Stelzer et al., 2016). The variant correlation matrix was highly structured, with formula factors explaining nearly 99% of the total variation of the 368 genetic variants. Noting the gap between the 14th and 15th eigenvalue, we selected formula estimated factors as instruments for the F-AR, F-LM, F-CLR, and S-LIML methods.

Only 6 of the 14 estimated factors were retained by the S-LIML method after pre-testing for relevant factors at the formula level. Table 1 shows that the S-LIML method gives a point estimate of formula for a unit increase in the risk of CHD associated with a 1 standard deviation change in LDL-C, with a corresponding 95% confidence interval [0.051,0.215]. The results were reasonably robust to the choice of pre-testing threshold υ used to select relevant estimated factors, however the S-LIML estimate becomes less precise for the stricter threshold formula which may be due to the exclusion of a relevant factor.

TABLE 1

Est. gives the point estimate of the method if applicable.

formulaformulaformulaformulaF-ARF-LMF-CLR
Est.0.1310.1320.1330.129---
CI-L0.0520.0520.0510.045-0.0500.0520.051
CI-U0.2100.2110.2150.2200.3340.2130.215
Q-stat.0.9970.9970.9640.9450.998*0.999*0.999*
formulaformulaformulaformulaF-ARF-LMF-CLR
Est.0.1310.1320.1330.129---
CI-L0.0520.0520.0510.045-0.0500.0520.051
CI-U0.2100.2110.2150.2200.3340.2130.215
Q-stat.0.9970.9970.9640.9450.998*0.999*0.999*

Note: CI-L and CI-U are the lower and upper estimated 95% confidence intervals. The brackets after S-LIML indicate the threshold level υ taken. formula gives the p-value associated with testing the null of no heterogeneity in instrument–LDL-C and instrument–CHD associations using the Sargan–Hansen test; see the discussion in Section 5.

TABLE 1

Est. gives the point estimate of the method if applicable.

formulaformulaformulaformulaF-ARF-LMF-CLR
Est.0.1310.1320.1330.129---
CI-L0.0520.0520.0510.045-0.0500.0520.051
CI-U0.2100.2110.2150.2200.3340.2130.215
Q-stat.0.9970.9970.9640.9450.998*0.999*0.999*
formulaformulaformulaformulaF-ARF-LMF-CLR
Est.0.1310.1320.1330.129---
CI-L0.0520.0520.0510.045-0.0500.0520.051
CI-U0.2100.2110.2150.2200.3340.2130.215
Q-stat.0.9970.9970.9640.9450.998*0.999*0.999*

Note: CI-L and CI-U are the lower and upper estimated 95% confidence intervals. The brackets after S-LIML indicate the threshold level υ taken. formula gives the p-value associated with testing the null of no heterogeneity in instrument–LDL-C and instrument–CHD associations using the Sargan–Hansen test; see the discussion in Section 5.

In Table 1, the confidence intervals for S-LIML and identification-robust methods are obtained by test inversion. The 95% asymptotic confidence intervals for the F-CLR and F-LM tests are similar to the S-LIML intervals, while the F-AR approach is much less precise and is unable to reject the null hypothesis of no causal association (F-AR p-value: 0.455).

The results of alternative summary data methods are presented in Table 2. We find that DecorrIVW and CLR with correlated variants are quite sensitive to the pruning threshold chosen, with the 95% asymptotic confidence interval of DecorrIVW-1 not overlapping with the DecorrIVW-2 interval.

Our simulation study illustrated how our factor-based approaches were relatively robust to biases from direct variant effects on the outcome. The heterogeneity plots in Figure 3 can provide insight on the coherency of evidence across multiple instruments. A more formal method to test for excessive heterogeneity uses the Sargan–Hansen (1982) test. Table 1 shows that the F-LIML and S-LIML approaches provide strong evidence of no heterogeneity when using estimated factors as instruments. Since identification-robust methods do not provide point estimates, for the starred entries in the last rows of Tables 1 and 2, we evaluated the Sargan–Hansen test statistic (1982) at the mid-point of the relevant confidence interval. There was no ‘degrees of freedom’ correction for this substitution which should lead to more conservative p-values (i.e., we are less likely to reject the null of no heterogeneity). Despite this, more liberal pruning-based approaches show evidence of greater heterogeneity when considering individual variants as instruments.

TABLE 2

Est. gives the point estimate of the method if applicable.

CLR-0CLR-1CLR-2CLR-4RAPS-0DIVW-0DecorrIVW-1DecorrIVW-2
Est.0.1220.1200.0680.117
CI-L0.0560.0210.0880.1070.0810.0800.0430.101
CI-U0.2180.1200.1530.1470.1630.1600.0920.133
Q-stat.0.282*0.610*0.391*0.012*0.2520.2520.4130.000
CLR-0CLR-1CLR-2CLR-4RAPS-0DIVW-0DecorrIVW-1DecorrIVW-2
Est.0.1220.1200.0680.117
CI-L0.0560.0210.0880.1070.0810.0800.0430.101
CI-U0.2180.1200.1530.1470.1630.1600.0920.133
Q-stat.0.282*0.610*0.391*0.012*0.2520.2520.4130.000

Note: CI-L and CI-U are the lower and upper estimated 95% confidence intervals. formula gives the p-value associated with testing the null of no heterogeneity in instrument–LDL-C and instrument–CHD associations using the Sargan–Hansen test; see the discussion in Section 5.

TABLE 2

Est. gives the point estimate of the method if applicable.

CLR-0CLR-1CLR-2CLR-4RAPS-0DIVW-0DecorrIVW-1DecorrIVW-2
Est.0.1220.1200.0680.117
CI-L0.0560.0210.0880.1070.0810.0800.0430.101
CI-U0.2180.1200.1530.1470.1630.1600.0920.133
Q-stat.0.282*0.610*0.391*0.012*0.2520.2520.4130.000
CLR-0CLR-1CLR-2CLR-4RAPS-0DIVW-0DecorrIVW-1DecorrIVW-2
Est.0.1220.1200.0680.117
CI-L0.0560.0210.0880.1070.0810.0800.0430.101
CI-U0.2180.1200.1530.1470.1630.1600.0920.133
Q-stat.0.282*0.610*0.391*0.012*0.2520.2520.4130.000

Note: CI-L and CI-U are the lower and upper estimated 95% confidence intervals. formula gives the p-value associated with testing the null of no heterogeneity in instrument–LDL-C and instrument–CHD associations using the Sargan–Hansen test; see the discussion in Section 5.

Overall, our findings provide robust evidence that genetically-predicted lower LDL-C levels using variants in the CETP gene region are associated with a lower risk of CHD.

6. Conclusion

There is an increasing focus on using cis-MR analyses to guide drug development; genetic evidence may be crucial to support novel targets, precision medicine subgroups, and the design of expensive clinical trials (Gill et al., 2021). The use of a few uncorrelated variants as instruments may lead to inferences which are vulnerable to direct variant effects on the outcome. On the other hand, using correlated variants as instruments may result in unstable inferences which are particularly sensitive to common problems of misspecified variant correlations. We believe that our factor-based approach provides a robust and practical way to assess the general weight of evidence for a causal association from the gene region of interest. Given its reliable performance in simulation, we recommend the use of the F-CLR test alongside existing robust methods in cis-MR investigations, especially in settings where there are multiple genetic signals for the exposure in the gene region.

A limitation of our approach is that we require the knowledge of a large genetic correlation matrix of the gene region, and systematic misspecification of these correlations (e.g., due to structural differences in the two sampled populations) may lead to biased inferences. At the same time, in settings where all variants are valid instruments, methods which use individual variants as instruments, rather than genetic factors, tend to provide a more powerful analysis, especially in cases where there are only a few strong genetic signals for the exposure. Furthermore, when the quality of available genetic correlation estimates is in doubt or there is only one strong genetic signal, it may be sensible to compare results from methods using correlated variants to those of the Wald ratio estimator based on the variant most strongly associated with the exposure. Finally, our approach assumes that the true number of factors is known. We leave for future the work the problem of formalizing a potential bias-variance trade off associated with the use of additional factors as instruments, and developing a procedure that determines an optimal selection.

Data Availability Statement

The data that support the findings in this paper are openly available. Summary data on coronary artery disease from CARDIoGRAMplusC4D investigators were accessed at http://www.CARDIOGRAMPLUSC4D.ORG. Summary data on low-density lipoprotein cholesterol concentrations from Neale Lab's analysis of UK Biobank data were accessed at http://www.nealelab.is/uk-biobank/. Data on genetic variant correlations from The 1000 Genomes Project Consortium (1000 Genomes Project, Auton et al., 2015) were accessed using the twosampleMR R software package (Hemani et al., 2018).

Acknowledgments

Ashish Patel and Paul Newcombe were funded by the UK Medical Research Council (programme number MC-UU-00002-9). Paul Newcombe also acknowledges support from the NIHR Cambridge BRC. Stephen Burgess was supported by Sir Henry Dale Fellowship jointly funded by the Wellcome Trust and the Royal Society (grant number 204623-Z-16-Z). Dipender Gill was funded by the Wellcome 4i Program at Imperial College London (award number 203928-Z-16-Z), the British Heart Foundation Centre of Research Excellence (RE-18-4-34215) at Imperial College London, and a National Institute for Health Research Clinical Lectureship at St George' s, University of London (CL-2020-16-001). We thank Jack Bowden, Francis DiTraglia, Apostolos Gkatzionis, Alexei Onatski, Richard Smith, and Chris Wallace for helpful discussions. We thank two anonymous referees and the associate editor for detailed comments.

References

Anderson
,
T.W.
&
Rubin
,
H.
(
1949
)
Estimation of the parameters of a single equation in a complete system of stochastic equations
.
The Annals of Mathematical Statistics
,
20
,
46
63
.

Andrews
,
D.W.
,
Moreira
,
M.J.
&
Stock
,
J.H.
(
2006
)
Optimal two-sided invariant similar tests for instrumental variables regression
.
Econometrica
,
74
,
715
752
.

Andrews
,
D.W.
,
Moreira
,
M.J.
&
Stock
,
J.H.
(
2007
)
Performance of conditional Wald tests in IV regression with weak instruments
.
Journal of Econometrics
,
139
,
116
132
.

Andrews
,
I.
,
Stock
,
J.H.
&
Sun
,
L.
(
2019
)
Weak instruments in instrumental variables regression: theory and practice
.
Annual Review of Economics
,
11
,
727
753
.

Auton
,
A.
,
Abecasis
,
G.R.
,
Altshuler
,
D.M.
,
Durbin
,
R.M.
,
Bentley
,
D.R.
&
McEwen
,
J.E.
(
2015
)
A global reference for human genetic variation
.
Nature
,
526
,
68
74
.

Bai
,
J.
(
2003
)
Inferential theory for factor models of large dimensions
.
Econometrica
,
71
,
135
171
.

Bai
,
J.
&
Ng
,
S.
(
2002
)
Determining the number of factors in approximate factor models
.
Econometrica
,
70
,
191
221
.

Bai
,
J.
&
Ng
,
S.
(
2010
)
Instrumental variable estimation in a data rich environment
.
Econometric Theory
,
26
,
1577
1606
.

Bi, N., Kang, H. & Taylor, J. (2020) Inferring treatment effects after testing instrument strength in linear models. arXiv:2003.06723 (pre-print),

1
24
.

Bowden
,
J.
,
Del Greco M
,
F.
,
Minelli
,
C.
,
Zhao
,
Q.
,
Lawlor
,
D.A.
&
Sheehan
,
N.A.
(
2019
)
Improving the accuracy of two-sample summary-data Mendelian randomization: moving beyond the NOME assumption
.
International Journal of Epidemiology
,
48
,
728
742
.

Bowman
,
L.
,
Hopewell
,
J.C.
,
Chen
,
F.
,
Wallendszus
,
K.
,
Stevens
,
W.
,
Collins
,
R.T.
&
HPS3 and TIMI55 REVEAL Collaborative Group
. (
2017
)
Effects of anacetrapib in patients with atherosclerotic vascular disease
.
New England Journal of Medicine
,
377
,
1217
1227
.

Burgess
,
S.
,
Butterworth
,
A.
&
Thompson
,
S.G.
(
2013
)
Mendelian randomization analysis with multiple genetic variants using summarized data
.
Genetic Epidemiology
,
37
,
658
665
.

Burgess
,
S.
,
Mason
,
A.M.
,
Grant
,
A.
,
Slob
,
E. A.W.
,
Gkatzionis
,
A.
&
Gill
,
D.
(
2023
)
Using genetic association data to guide drug discovery and development: review of methods and applications
.
American Journal of Human Genetics
,
110
,
195
214
.

Chao
,
J.C.
&
Swanson
,
N.R.
(
2005
)
Consistent estimation with a large number of weak instruments
.
Econometrica
,
73
,
1673
1692
.

Fithian, W., Sun, D.L. & Taylor, J. (2017) Optimal inference after model selection. arXiv:1410.2597 (pre-print),

1
39
.

Gill
,
D.
,
Georgakis
,
M.K.
,
Walker
,
V.M.
,
Schmidt
,
A.F.
,
Gkatzionis
,
A.
&
Davies
,
N.M.
(
2021
)
Mendelian randomization for studying the effects of perturbing drug targets
.
Wellcome Open Research
,
6
,
1
19
.

Goering
,
H.H.
,
Terwilliger
,
J.D.
&
Blangero
,
J.
(
2002
)
Large upward bias in estimation of locus-specific effects from genome-wide scans
.
The American Journal of Human Genetics
,
69
,
1357
1369
.

Guggenberger
,
P.
&
Kumar
,
G.
(
2012
)
On the size distortion of tests after an overidentifying restrictions pretest
.
Journal of Applied Econometrics
,
27
,
1138
1160
.

Hansen
,
L.P.
(
1982
)
Large sample properties of generalized method of moments estimators
.
Econometrica
,
50
,
1029
1054
.

Hemani
,
G.
,
Bowden
,
J.
&
Davey Smith
,
G.
(
2018
)
Evaluating the potential role of pleiotropy in Mendelian randomization studies
.
Human Molecular Genetics
,
27
,
R195
R208
.

Hemani
,
G.
, Zheng. J.,
Elsworth
,
B.
,
Wade
,
K.H.
,
Haberland
,
V.
,
Baird
,
D.
,
Laurin
,
C.
,
Burgess
,
S.
,
Bowden
,
J.
,
Langdon
,
R.
,
Tan
,
V.Y.
,
Yarmolinsky
,
J.
,
Shihab
,
H.A.
,
Timpson
,
N.J.
,
Evans
,
D.M.
,
Relton
,
C.
,
Martin
,
R.M.
,
Davey Smith
,
G.
,
Gaunt
,
T.R.
&
Haycock
,
P.C.
(
2018
)
The MR-Base platform supports systematic causal inference across the human phenome
.
eLife
,
7
,
1
29
.

Kleibergen
,
F.
(
2005
)
Testing parameters in GMM without assuming that they are identified
.
Econometrica
,
73
,
1103
1123
.

Moreira
,
M.J.
(
2003
)
A conditional likelihood ratio test for structural models
.
Econometrica
,
71
,
1027
1048
.

Mounier
,
N.
&
Kutalik
,
Z.
(
2023
)
Bias correction for inverse variance weighting Mendelian randomization
.
Genetic Epidemiology
,
1
,
1
18
.

Newey
,
W.K.
&
Windmeijer
,
F.
(
2009
)
Generalized method of moments with many weak moment conditions
.
Econometrica
,
77
,
687
719
.

Nikpay et al., M.
(
2015
)
A comprehensive 1000 Genomes-based genome-wide association meta-analysis of coronary artery disease
.
Nature Genetics
,
47
,
1121
30
.

Onatski
,
A.
(
2010
)
Determining the number of factors from empirical distribution of eigenvalues
.
Review of Economics and Statistics
,
92
,
1004
1016
.

Sampson
,
A.R.
&
Sill
,
M.W.
(
2005
)
Drop-the-losers design: normal case
.
Biometrical Journal
,
47
,
257
268
.

Schmidt
,
A.F.
,
Hunt
,
N.B.
,
Gordillo-Maranon
,
M.
,
Charoen
,
P.
,
Drenos
,
F.
&
Finan
,
C.
(
2021
)
Cholesteryl Ester Transfer Protein (CETP) as a drug target for cardiovascular disease
.
Nature Communications
,
12
,
1
10
.

Stelzer
,
G.
,
Rosen
,
N.
,
Plaschkes
,
I.
,
Zimmerman
,
S.
,
Twik
,
M.
,
Fishilevich
,
S.
,
Stein
,
T.I.
,
Nudel
,
R.
,
Lieder
,
I.
,
Mazor
,
Y.
,
Kaplan
,
S.
,
Dahary
,
D.
,
Warshawsky
,
D.
,
Guan-Golan
,
Y.
,
Kohn
,
A.
,
Rappaport
,
N.
,
Safran
,
M.
&
Lancet
,
D.
(
2016
)
The GeneCards suite: From gene data mining to disease genome sequence analyses
.
Current Protocols in Bioinformatics
,
54
,
1
30
.

Stock
,
J.H.
,
Wright
,
J.H.
&
Yogo
,
M.
(
2002
)
A survey of weak instruments and weak identification in generalized method of moments
.
Journal of Business and Economic Statistics
,
20
,
518
529
.

Sudlow
,
C.
,
Gallacher
,
J.
,
Allen
,
N.
,
Beral
,
V.
,
Burton
,
P.
&
Danesh
,
J.
(
2015
)
UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age
.
PLoS Medicine
,
12
,
1
10
.

Swerdlow
,
D.I.
,
Kuchenbaecker
,
K.B.
,
Shah
,
S.
,
Sofat
,
R.
,
Holmes
,
M.V.
&
Hingorani
,
A.D.
(
2016
)
Selecting instruments for Mendelian randomization in the wake of genome-wide association studies
.
International Journal of Epidemiology
,
45
,
1600
1616
.

Walker
,
V.M.
,
Smith
,
G.D.
,
Davies
,
N.M.
&
Martin
,
R.M.
(
2017
)
Mendelian randomization: a novel approach for the prediction of adverse drug events and drug repurposing opportunities
.
International Journal of Epidemiology
,
46
,
2078
2089
.

Wang
,
S.
&
Kang
,
H.
(
2022
)
Weak-instrument robust tests in two-sample summary-data Mendelian randomization
.
Biometrics
,
78
,
1699
1713
.

Yang
,
J.
,
Ferreira
,
T.
,
Morris
,
A.P.
,
Medland
,
S.E.
, Genetic Investigation of ANthropometric Traits (GIANT)
Consortium
,
DIAbetes Genetics Replication And Meta-analysis (DIAGRAM) Consortium
,
Madden
,
P.A.
,
Heath
,
A.C.
,
Martin
,
N.G.
,
Montgomery
,
G.W.
,
Weedon
,
M.N.
,
Loos
,
R.J.
,
Frayling
,
T.M.
,
McCarthy
,
M.I.
,
Hirschhorn
,
J.N.
,
Goddard
,
M.E.
&
Visscher
,
P.M.
(
2012
)
Conditional and joint multiple-SNP analysis of GWAS summary statistics identifies additional variants influencing complex traits
.
Nature Genetics
,
44
,
369
375
.

Ye
,
B.T.
,
Shao
,
J.
&
Kang
,
H.
(
2021
)
Debiased inverse-variance weighted estimator in two-sample summary-data Mendelian randomization
.
Annals of Statistics
,
49
,
2079
2100
.

Zhao
,
Q.
,
Wang
,
J.
,
Hemani
,
G.
,
Bowden
,
J.
&
Small
,
D.S.
(
2020
)
Statistical inference in two-sample summary-data Mendelian randomization using robust adjusted profile score
.
Annals of Statistics
,
48
,
1742
1769
.

This is an open access article under the terms of the http://creativecommons.org/licenses/by/4.0/ License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.

Supplementary data